ディープ強化学習（DRL）入門

ディープ強化学習（DRL）は、深層ニューラルネットワークの高次元表現能力と強化学習の最適制御フレームワークを統合しています。教師あり学習や教師なし学習とは異なり、DRLではエージェントが、動的な環境において試行錯誤を通じて学習し、逐次的な意思決定を即座に明示的なラベルなしに行います。この統合により、エージェントは複雑な生データ（例：ピクセル情報）を直接処理できるようになります。

1. DRLの学習パラダイム

強化学習エージェントは連続するループで動作します：環境の状態（$S_t$）を観測し、行動（$A_t$）を実行し、稀な場合や遅延するスカラ値の報酬（$R_{t+1}$）を受け取ります。主な課題は報酬割り当て問題であり、将来の報酬信号に対してどの過去の行動が責任を持っていたかを特定することです。

2. 最適化の目的

最終的な目標は、最適な戦略、すなわち方策（$\pi^*$）を見つけることです。これは、状態から行動へのマッピングであり、期待累積割引報酬（$G_t$）を最大化するものです。割引係数（$\gamma \in [0, 1]$）は数学的に極めて重要で、直近の報酬と未来に得られる報酬の価値のバランスを定義します。

$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

The Fundamental RL Cycle

An illustration of the Markov Decision Process (MDP) framework. The Agent's policy dictates the action ($A_t$) based on the current state ($S_t$), leading the Environment to transition to a new state ($S_{t+1}$) and provide a reward ($R_{t+1}$).

The Reinforcement Learning Cycle: Agent, Environment, State, Action, Reward

Question 1

How does the DRL agent receive feedback from the environment?

Explicit labels/targets

Backpropagation through time

Scalar reward signal

Labeled demonstration data

Question 2

What does the policy ($\pi$) mathematically represent?

The predicted total reward

A distribution over actions given a state

The probability of transitioning to a new state

The error between predicted and actual returns

Challenge: The Discount Factor

Analyzing the Temporal Horizon.

Consider two scenarios:
1. $\gamma = 0$
2. $\gamma \approx 1$

Describe the agent's behavioral preference in each case regarding the timeline of rewards.

Step 1

How does the choice of $\gamma$ affect the policy's horizon?

Solution:
If $\gamma = 0$, the agent is myopic (shortsighted), focusing only on the immediate reward $R_{t+1}$. If $\gamma \approx 1$, the agent is far-sighted, equally weighting immediate and distant future rewards, leading to planning over a very long horizon.